Problem Description:

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective¶

To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more.

In [250]:
#import libraries

import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline 
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Command to tell Python to actually display the graphs
%matplotlib inline
In [251]:
path = "Loan_Modelling.csv"
data = pd.read_csv(path)
In [252]:
data.shape
Out[252]:
(5000, 14)
In [253]:
data.head(10)
Out[253]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
5 6 37 13 29 92121 4 0.4 2 155 0 0 0 1 0
6 7 53 27 72 91711 2 1.5 2 0 0 0 0 1 0
7 8 50 24 22 93943 1 0.3 3 0 0 0 0 0 1
8 9 35 10 81 90089 3 0.6 2 104 0 0 0 1 0
9 10 34 9 180 93023 1 8.9 3 0 1 0 0 0 0
In [254]:
#dropping ID column since it shouldn't be used in our predictions
data.drop('ID',axis=1,inplace=True)
data
Out[254]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

5000 rows × 13 columns

In [255]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIPCode             5000 non-null   int64  
 4   Family              5000 non-null   int64  
 5   CCAvg               5000 non-null   float64
 6   Education           5000 non-null   int64  
 7   Mortgage            5000 non-null   int64  
 8   Personal_Loan       5000 non-null   int64  
 9   Securities_Account  5000 non-null   int64  
 10  CD_Account          5000 non-null   int64  
 11  Online              5000 non-null   int64  
 12  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(12)
memory usage: 507.9 KB
In [256]:
data.describe(include='all').T
Out[256]:
count mean std min 25% 50% 75% max
Age 5000.0 45.338400 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.0 20.0 30.0 43.0
Income 5000.0 73.774200 46.033729 8.0 39.0 64.0 98.0 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.0 93437.0 94608.0 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 1.881000 0.839869 1.0 1.0 2.0 3.0 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.0 0.0 0.0 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.0 0.0 0.0 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.0 0.0 0.0 1.0
Online 5000.0 0.596800 0.490589 0.0 0.0 1.0 1.0 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.0 0.0 1.0 1.0
In [257]:
#ID	Age	Experience	Income	ZIPCode	Family	CCAvg	Education	Mortgage	Personal_Loan	Securities_Account	CD_Account	Online	CreditCard
# sns.histplot(data=data, x='price')
data
Out[257]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ...
4995 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 28 4 83 92612 3 0.8 1 0 0 0 0 1 1

5000 rows × 13 columns

EDA¶

We want to see how our independent variables relate to our dependent variable Personal Loan

In [258]:
sns.histplot(data=data, x='Age');

There is no clear pattern here.

In [259]:
sns.histplot(data=data, x='Experience');
In [260]:
sns.histplot(data=data, x='Income');

Income appears to be right skewed

In [261]:
sns.histplot(data=data, x='ZIPCode');
In [262]:
sns.histplot(data=data, x='Family');
In [263]:
sns.histplot(data=data, x='CCAvg');

CCAvg appears to be right skewed

In [264]:
sns.histplot(data=data, x='Education');

The largest group in terms of education is undergraduates

In [265]:
sns.histplot(data=data, x='Mortgage');

Most people don't have a morgage

In [266]:
sns.histplot(data=data, x='Securities_Account');

Most customers don't have a securities account with the bank

In [267]:
sns.histplot(data=data, x='CD_Account');

Most customers dont have a CD account with the bank

In [268]:
sns.histplot(data=data, x='Online');

Most customers use online banking

In [269]:
sns.histplot(data=data, x='CreditCard');

Most customers have a credit card exclusively with all life bank

Some independent variables had interesting results when graphed using a box plot. We are able to see some outliers.

In [270]:
sns.boxplot(data=data, x="Income")
Out[270]:
<AxesSubplot:xlabel='Income'>
In [271]:
sns.boxplot(data=data, x="CCAvg")
Out[271]:
<AxesSubplot:xlabel='CCAvg'>

People with Average spending on credit cards per month over 5000 dollars are outliers Most people are one the lower end of credit card expenditures. There are a lot of outliers

In [272]:
sns.boxplot(data=data, x="Mortgage")
Out[272]:
<AxesSubplot:xlabel='Mortgage'>

This makes sense because most people don't have a morgage. People with mortgages are outliers.

In [273]:
sns.pairplot(data,diag_kind='kde')
Out[273]:
<seaborn.axisgrid.PairGrid at 0x7fa3cdf52df0>
In [274]:
plot_corr(data)

Plotting correlations we see that Age and experience have a high correlation. Infact they have a maximum correlation to each other Income also has a significant correlation with ccAvg

Split Data¶

In [276]:
X = data.drop("Personal_Loan" , axis=1)
Y = data.pop("Personal_Loan")
In [277]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.30, random_state=1)
In [278]:
#check the split
print("{0:0.2f}% data is in training set".format((len(X_train)/len(data.index)) * 100))
print("{0:0.2f}% data is in test set".format((len(X_test)/len(data.index)) * 100))
70.00% data is in training set
30.00% data is in test set

We find the split is 70/30 which is what we were expecting

Build Decision Tree Model¶

In [279]:
dTree = DecisionTreeClassifier(criterion = 'gini', random_state=1)
dTree.fit(X_train, Y_train)
Out[279]:
DecisionTreeClassifier(random_state=1)
In [280]:
print("Accuracy on training set : ",dTree.score(X_train, Y_train))
print("Accuracy on test set : ",dTree.score(X_test, Y_test))
Accuracy on training set :  1.0
Accuracy on test set :  0.98

The Accuracy between the training set and test set are close to the same which is really good. We want them to be close

In [281]:
#Checking number of positives
Y.sum(axis = 0)
Out[281]:
480

480 people will accept the Personal_loan offer within 98 percent accuracy. Out of 1500 evaluated

Consideration: Maybe we don't need such a high degree of accuracy becasue the cost of sending out an offer and the customer rejecting it is really low and But the cost of a customer who might accept but doen't get an offer makes the bank lose out on potential revenue

In [282]:
480/1500 # percentage 
Out[282]:
0.32

The model is telling us to market the personal loan to about 32% of the test data population

Adding some helper methods to calculate recall and generate a confusion matrix

In [283]:
## Function to create confusion matrix
def make_confusion_matrix(model,y_actual,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    y_actual : ground truth  
    
    '''
    y_predict = model.predict(X_test)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[0, 1])
    df_cm = pd.DataFrame(cm, index = [i for i in ["Actual - No","Actual - Yes"]],
                  columns = [i for i in ['Predicted - No','Predicted - Yes']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='')
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
In [284]:
##  Function to calculate recall score
def get_recall_score(model):
    '''
    model : classifier to predict values of X

    '''
    pred_train = model.predict(X_train)
    pred_test = model.predict(X_test)
    print("Recall on training set : ",metrics.recall_score(Y_train,pred_train))
    print("Recall on test set : ",metrics.recall_score(Y_test,pred_test))

Recalll is better than accuracy because it allows us to be sure we are captureing the true positives. Even if that means increasing false positives. We don't care as much about accidentally sending offers to someone who might reject the offer. The cost to send an offer is very low and the benefit of is very high if we send out extras and have more people accept offers

In [285]:
#generate a confustion matrix
make_confusion_matrix(dTree,Y_test)
In [286]:
# Recall on train and test
get_recall_score(dTree)
Recall on training set :  1.0
Recall on test set :  0.8859060402684564

Visualizing the Decision Tree¶

In [287]:
feature_names = list(X.columns)
print(feature_names)
['Age', 'Experience', 'Income', 'ZIPCode', 'Family', 'CCAvg', 'Education', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard']
In [288]:
plt.figure(figsize=(20,30))
tree.plot_tree(dTree,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [289]:
print(tree.export_text(dTree,feature_names=feature_names,show_weights=True))
|--- Income <= 116.50
|   |--- CCAvg <= 2.95
|   |   |--- Income <= 106.50
|   |   |   |--- weights: [2553.00, 0.00] class: 0
|   |   |--- Income >  106.50
|   |   |   |--- Family <= 3.50
|   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |--- weights: [35.00, 0.00] class: 0
|   |   |   |   |--- Education >  1.50
|   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |--- Age <= 41.50
|   |   |   |   |   |   |   |--- weights: [16.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  41.50
|   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |--- Family >  3.50
|   |   |   |   |--- Experience <= 3.50
|   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |--- Experience >  3.50
|   |   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |   |--- Experience <= 7.00
|   |   |   |   |   |   |   |--- Age <= 29.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  29.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |--- Experience >  7.00
|   |   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |   |--- Age >  60.00
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |--- CCAvg <= 3.35
|   |   |   |   |   |   |   |--- Experience <= 13.00
|   |   |   |   |   |   |   |   |--- Age <= 33.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  33.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Experience >  13.00
|   |   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |   |--- weights: [23.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.35
|   |   |   |   |   |   |   |--- Family <= 3.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Family >  3.00
|   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |   |--- weights: [43.00, 0.00] class: 0
|   |   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |   |--- Income <= 83.50
|   |   |   |   |   |   |   |   |--- Age <= 45.50
|   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  45.50
|   |   |   |   |   |   |   |   |   |--- Age <= 54.50
|   |   |   |   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  54.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Income >  83.50
|   |   |   |   |   |   |   |   |--- weights: [24.00, 0.00] class: 0
|   |   |   |--- CD_Account >  0.50
|   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- Education <= 1.50
|   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- Online <= 0.50
|   |   |   |   |   |   |   |--- Age <= 55.00
|   |   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  55.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Online >  0.50
|   |   |   |   |   |   |   |--- weights: [20.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- Experience <= 21.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Experience >  21.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |--- Income <= 93.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  93.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |--- Education >  1.50
|   |   |   |   |--- Age <= 63.50
|   |   |   |   |   |--- Mortgage <= 172.00
|   |   |   |   |   |   |--- CD_Account <= 0.50
|   |   |   |   |   |   |   |--- Age <= 60.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 21.00] class: 1
|   |   |   |   |   |   |   |--- Age >  60.50
|   |   |   |   |   |   |   |   |--- CCAvg <= 3.75
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- CCAvg >  3.75
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- CD_Account >  0.50
|   |   |   |   |   |   |   |--- Family <= 2.00
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Family >  2.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- Mortgage >  172.00
|   |   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- Age >  63.50
|   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|--- Income >  116.50
|   |--- Education <= 1.50
|   |   |--- Family <= 2.50
|   |   |   |--- weights: [375.00, 0.00] class: 0
|   |   |--- Family >  2.50
|   |   |   |--- weights: [0.00, 47.00] class: 1
|   |--- Education >  1.50
|   |   |--- weights: [0.00, 222.00] class: 1

In [290]:
# What is the importance of each feature in the tree
print (pd.DataFrame(dTree.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
Education           0.401465
Income              0.308336
Family              0.169593
CCAvg               0.044408
Age                 0.035708
CD_Account          0.025711
Experience          0.011203
Mortgage            0.003014
Online              0.000561
ZIPCode             0.000000
Securities_Account  0.000000
CreditCard          0.000000

The customer's Zipcodes, whether or not they had a securities_account and a credit card account had no predictiveness of whether the customer would accept a persional loan offer.

Their Experience, Whether they had a CD account, Age and CCAvg also had very little importance

In [291]:
importances = dTree.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12,12))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

According to the graph above the Education, Income and Family are the most important variables. The Bank should focus on these when launching their campaigns

In [292]:
# To reduce the complexity of the tree we will set the max depth limit to 5

dTree1 = DecisionTreeClassifier(criterion = 'gini',max_depth=5,random_state=1)
dTree1.fit(X_train, Y_train)
Out[292]:
DecisionTreeClassifier(max_depth=5, random_state=1)
In [293]:
plt.figure(figsize=(15,10))

tree.plot_tree(dTree1,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()
In [294]:
#There is still a high recall on training set and a slightly lower recal on test set. Which is fine. We reduced overfitting.
get_recall_score(dTree1)
Recall on training set :  0.9516616314199395
Recall on test set :  0.8791946308724832
In [295]:
print (pd.DataFrame(dTree1.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
Education           0.438816
Income              0.325524
Family              0.156494
CCAvg               0.041313
CD_Account          0.024794
Experience          0.009097
Age                 0.003963
ZIPCode             0.000000
Mortgage            0.000000
Securities_Account  0.000000
Online              0.000000
CreditCard          0.000000

When reducing the dept of the tree the importance of Education and Income increases while the importance of family size decreases

In [ ]:
 
In [297]:
importances = dTree1.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(10,10))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='violet', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

We can see that the importance of Education, INcome and Family are in the same order. While age decreased in importance

Next we will focus on Hyperparameter tuning in an attempt to imporve the model.

In [ ]:
from sklearn.model_selection import GridSearchCV
In [ ]:
# Choose the type of classifier. 
estimator = DecisionTreeClassifier(random_state=1)

# Grid of parameters to choose from
## add from article
parameters = {'max_depth': np.arange(1,10), 
              'min_samples_leaf': [1, 2, 5, 7, 10,15,20],
              'max_leaf_nodes' : [2, 3, 5, 10],
              'min_impurity_decrease': [0.001,0.01,0.1]
             }

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.recall_score)

# Run the grid search using a 5 fold cross validation
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer,cv=5)
grid_obj = grid_obj.fit(X_train, Y_train)

# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data. 
estimator.fit(X_train, Y_train)
In [ ]:
make_confusion_matrix(estimator,Y_test)

The confusion matrix

True Positives - 131

True Negatives - 1341

False Positives (FP)- Type I error - 10

False Negatives (FN)Type II error - 18

In [302]:
#accuracy
print("Accuracy on training set : ",estimator.score(X_train, Y_train))
print("Accuracy on test set : ",estimator.score(X_test, Y_test))
#recall
get_recall_score(estimator)
Accuracy on training set :  0.9897142857142858
Accuracy on test set :  0.9813333333333333
Recall on training set :  0.9274924471299094
Recall on test set :  0.8791946308724832

We are able to achieve a high recall on test set

In [303]:
plt.figure(figsize=(15,10))

tree.plot_tree(estimator,feature_names=feature_names,filled=True,fontsize=9,node_ids=True,class_names=True)
plt.show()

The Tree we careate here is more simplified than the ones we started with and is more explainable

In [304]:
#printing the gini importance of each variable to what we are predicting

print (pd.DataFrame(estimator.feature_importances_, columns = ["Imp"], index = X_train.columns).sort_values(by = 'Imp', ascending = False))
                         Imp
Education           0.447999
Income              0.328713
Family              0.155711
CCAvg               0.042231
CD_Account          0.025345
Age                 0.000000
Experience          0.000000
ZIPCode             0.000000
Mortgage            0.000000
Securities_Account  0.000000
Online              0.000000
CreditCard          0.000000
In [305]:
# analyze reasonable alphas and gini impurites to seek further gains
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, Y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
In [306]:
pd.DataFrame(path)
Out[306]:
ccp_alphas impurities
0 0.000000 0.000000
1 0.000223 0.001114
2 0.000268 0.002188
3 0.000359 0.003263
4 0.000381 0.003644
5 0.000381 0.004025
6 0.000381 0.004406
7 0.000381 0.004787
8 0.000409 0.006423
9 0.000476 0.006900
10 0.000508 0.007407
11 0.000582 0.007989
12 0.000593 0.009175
13 0.000641 0.011740
14 0.000769 0.014817
15 0.000792 0.017985
16 0.001552 0.019536
17 0.002333 0.021869
18 0.003024 0.024893
19 0.003294 0.028187
20 0.006473 0.034659
21 0.023866 0.058525
22 0.056365 0.171255
In [307]:
fig, ax = plt.subplots(figsize=(10,5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker='o', drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
In [308]:
clfs = []
for ccp_alpha in ccp_alphas:
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
    clf.fit(X_train, Y_train)
    clfs.append(clf)
print("Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
      clfs[-1].tree_.node_count, ccp_alphas[-1]))
Number of nodes in the last tree is: 1 with ccp_alpha: 0.056364969335601575
In [309]:
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1,figsize=(10,7))
ax[0].plot(ccp_alphas, node_counts, marker='o', drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker='o', drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
In [317]:
train_scores = [clf.score(X_train, Y_train) for clf in clfs]
test_scores = [clf.score(X_test, Y_test) for clf in clfs]
In [318]:
fig, ax = plt.subplots(figsize=(10,5))
ax.set_xlabel("alpha")
ax.set_ylabel("accuracy")
ax.set_title("Accuracy vs alpha for training and testing sets")
ax.plot(ccp_alphas, train_scores, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, test_scores, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [320]:
index_best_model = np.argmax(test_scores)
best_model = clfs[index_best_model]
print(best_model)
print('Training accuracy of best model: ',best_model.score(X_train, Y_train))
print('Test accuracy of best model: ',best_model.score(X_test, Y_test))
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
Training accuracy of best model:  0.9928571428571429
Test accuracy of best model:  0.984

High accuracy but recall is more important

In [322]:
recall_train=[]
for clf in clfs:
    pred_train3=clf.predict(X_train)
    values_train=metrics.recall_score(Y_train,pred_train3)
    recall_train.append(values_train)
In [323]:
recall_test=[]
for clf in clfs:
    pred_test3=clf.predict(X_test)
    values_test=metrics.recall_score(Y_test,pred_test3)
    recall_test.append(values_test)
In [324]:
fig, ax = plt.subplots(figsize=(15,5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker='o', label="train",
        drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker='o', label="test",
        drawstyle="steps-post")
ax.legend()
plt.show()
In [325]:
# creating the model where we get highest train and test recall
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006414326414326415, random_state=1)
In [326]:
make_confusion_matrix(best_model,Y_test)
In [327]:
# Recall on train and test
get_recall_score(best_model)
Recall on training set :  0.9667673716012085
Recall on test set :  0.9060402684563759

This appears to be the highest recall amongst the several decision trees created for the test set

Logistic Regression¶

In [310]:
#generate a correlation matrix
data.corr()
Out[310]:
Age Experience Income ZIPCode Family CCAvg Education Mortgage Securities_Account CD_Account Online CreditCard
Age 1.000000 0.994215 -0.055269 -0.030530 -0.046418 -0.052012 0.041334 -0.012539 -0.000436 0.008043 0.013702 0.007681
Experience 0.994215 1.000000 -0.046574 -0.030456 -0.052563 -0.050077 0.013152 -0.010582 -0.001232 0.010353 0.013898 0.008967
Income -0.055269 -0.046574 1.000000 -0.030709 -0.157501 0.645984 -0.187524 0.206806 -0.002616 0.169738 0.014206 -0.002385
ZIPCode -0.030530 -0.030456 -0.030709 1.000000 0.027512 -0.012188 -0.008266 0.003614 0.002422 0.021671 0.028317 0.024033
Family -0.046418 -0.052563 -0.157501 0.027512 1.000000 -0.109275 0.064929 -0.020445 0.019994 0.014110 0.010354 0.011588
CCAvg -0.052012 -0.050077 0.645984 -0.012188 -0.109275 1.000000 -0.136124 0.109905 0.015086 0.136534 -0.003611 -0.006689
Education 0.041334 0.013152 -0.187524 -0.008266 0.064929 -0.136124 1.000000 -0.033327 -0.010812 0.013934 -0.015004 -0.011014
Mortgage -0.012539 -0.010582 0.206806 0.003614 -0.020445 0.109905 -0.033327 1.000000 -0.005411 0.089311 -0.005995 -0.007231
Securities_Account -0.000436 -0.001232 -0.002616 0.002422 0.019994 0.015086 -0.010812 -0.005411 1.000000 0.317034 0.012627 -0.015028
CD_Account 0.008043 0.010353 0.169738 0.021671 0.014110 0.136534 0.013934 0.089311 0.317034 1.000000 0.175880 0.278644
Online 0.013702 0.013898 0.014206 0.028317 0.010354 -0.003611 -0.015004 -0.005995 0.012627 0.175880 1.000000 0.004210
CreditCard 0.007681 0.008967 -0.002385 0.024033 0.011588 -0.006689 -0.011014 -0.007231 -0.015028 0.278644 0.004210 1.000000
In [311]:
def plot_corr(df, size=11):
    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    ax.matshow(corr)
    plt.xticks(range(len(corr.columns)), corr.columns)
    plt.yticks(range(len(corr.columns)), corr.columns)
    for (i, j), z in np.ndenumerate(corr):
        ax.text(j, i, '{:0.1f}'.format(z), ha='center', va='center')
In [312]:
# To get diferent metric scores

from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    plot_confusion_matrix,
    precision_recall_curve,
    roc_curve,
)
In [313]:
from sklearn import metrics

from sklearn.linear_model import LogisticRegression

# Fit the model on train
model = LogisticRegression(solver="liblinear", random_state=1)
model.fit(X_train, Y_train)
#predict on test
Y_predict = model.predict(X_test)


coef_df = pd.DataFrame(model.coef_)
coef_df['intercept'] = model.intercept_
print(coef_df)
          0        1         2         3        4         5         6  \
0  0.001235 -0.00132  0.036132 -0.000067  0.01521  0.009387  0.016434   

          7         8         9        10        11  intercept  
0  0.000833  0.000529  0.004639 -0.000131 -0.000022  -0.000063  
In [314]:
model_score = model.score(X_test, Y_test)
print(model_score)

#model scoreing well
0.9073333333333333
In [315]:
cm=metrics.confusion_matrix(Y_test, Y_predict, labels=[1, 0])

df_cm = pd.DataFrame(cm, index = [i for i in ["Actual 1"," Actual 0"]],
                  columns = [i for i in ["Predict 1","Predict 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True,fmt='g')
plt.show()

The confusion matrix for logical regression

True Positives - 43

True Negatives - 1318

False Positives (FP)- Type I error - 33

False Negatives (FN)Type II error - 106

The confusion matrix for the previous decicion tree

True Positives - 131

True Negatives - 1341

False Positives (FP)- Type I error - 10

False Negatives (FN)Type II error - 18

It appears that the decision tree had fewer type I and II errors.

Comparison of models¶

Both models performed well.

The business would likeley want to target customers with High education levels and high income

They business should ignore the account types of each customer as that had little effect on whether they accepted the loan offer in the campaign